Fault-tolerant scheduling on parallel systems with non-memoryless failure distributions
نویسندگان
چکیده
As large parallel systems increase in size and complexity, failures are inevitable and exhibit complex space and time dynamics. Most often, in real systems, failure rates are increasing or decreasing over time. Considering non-memoryless failure distributions, we study a bi-objective scheduling problem of optimizing application makespan and reliability. In particular, we determine whether one can optimize both makespan and reliability simultaneously, orwhether onemetricmust be degraded in order to improve the other. We also devise scheduling algorithms for achieving (approximately) optimal makespan or reliability.When failure rates decrease, we prove thatmakespan and reliability are opposingmetrics. In contrast, when failure rates increase, we prove that one can optimize both makespan and reliability simultaneously. Moreover, we show that the largest processing time (LPT) list scheduling algorithm achieves good performance when processors are of uniform speed. The implications of our findings are the accelerated completion and improved reliability of parallel jobs executed across large distributed systems. Finally, we conduct simulations to investigate the impact of failures on the performance, which is done using an actual application of biological sequence comparison. © 2014 Elsevier Inc. All rights reserved.
منابع مشابه
A New Proactive Fault Tolerant Approach for Scheduling in Computational Grid
Grid Computing provides non-trivial services to users and aggregates the power of widely distributed resources. Computational grids solve large scale scientific problems using distributed heterogeneous resources. The Grid Scheduler must select proper resources for executing the tasks with less response time and without missing the deadline. There are various reasons such as network failure, ove...
متن کاملAn Efficient Fault Tolerant Scheduling Approach for Computational Grid
Grid computing serves as an important technology to facilitate distributed computation computational grids solve large scale scientific problems using heterogeneous geographically distributed resources. Problems like dispatching and scheduling of tasks are considered as major issues in computational grid environment. The Grid Scheduler must select proper resources for executing the tasks with l...
متن کاملA Survey on Fault Tolerance Mechanisms for job scheduling in Grid computing
Grid computing is defined as a hardware and software infrastructure that enables sharing of coordinated resources in a dynamic environment. In grid computing, the probability of a failure is much greater than parallel computing. Therefore, the fault tolerance is an important issue in order to achieve reliability, availability of resources. When scheduling a job, the resource uses both average f...
متن کاملReal-time Fault-tolerant Scheduling Algorithm for Distributed Computing Systems
This article proposes a Distributed Realtime Fault-tolerant model, priority Real-time Fault-tolerant algorithm and computational architecture of Distributed Real-time Fault-tolerant. According to this model, the problem of how to schedule a weighted Directed Acyclic Graph (DAG) in Distributed computing system for high reliability can be solved in the presence of multiprocessors faults. When som...
متن کاملStability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- J. Parallel Distrib. Comput.
دوره 74 شماره
صفحات -
تاریخ انتشار 2014